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Abstract 

This  paper  describes  the  MIT-LL/AFRL  statistical  MT 
system  and  the  improvements  that  were  developed  during  the 
IWSLT  2011  evaluation  campaign.  As  part  of  these  efforts, 
we  experimented  with  a  number  of  extensions  to  the  standard 
phrase-based  model  that  improve  performance  on  the  Arabic 
to  English  and  English  to  French  TED-talk  translationtasks. 
We  also  applied  our  existing  ASR  system  to  the  TED-talk 
lecture  ASR  task. 

We  discuss  the  architecture  of  the  MIT-LL/AFRL  MT 
system,  improvements  over  our  2010  system,  and  experi¬ 
ments  we  ran  during  the  IWSLT-2011  evaluation.  Specifi¬ 
cally,  we  focus  on  1)  speech  recognition  for  lecture-like  data, 
2)  cross-domain  translation  using  MAP  adaptation,  and  3) 
improved  Arabic  morphology  for  MT  preprocessing. 

1.  Introduction 

During  the  evaluation  campaign  for  the  201 1  International 
Workshop  on  Spoken  Language  Translation  (IWSLT-2011) 
our  experimental  efforts  centered  on  1)  speech  recognition 
for  lecture-like  data,  2)  cross-domain  translation  using  MAP 
adaptation,  and  3)  improved  Arabic  morphology  for  MT  pre¬ 
processing. 

In  this  paper  we  describe  improvements  over  our  2010 
baseline  systems  and  methods  we  used  to  combine  outputs 
from  multiple  systems.  For  a  more  full  description  of  the 
2010  baseline  system,  refer  to  [1], 

The  remainder  of  this  paper  is  structured  as  follows.  In 
section  2,  we  present  an  overview  of  our  baseline  system  and 
the  minor  improvements  to  this  standard  statistical  MT  ar¬ 
chitecture  that  we  developed.  In  sections  3, 4,  and  5  we  de¬ 
scribe  experiments  for  cross-domain  adaptation,  better  Turk¬ 
ish  and  Arabic  morphological  processing,  improved  handling 
of  speech  input  and  our  implementation  of  MT  system  com¬ 
bination.  Section  7  describes  the  systems  we  submitted  for 

iThis  work  is  sponsored  by  the  Air  Force  Research  Laboratory  under 
Air  Force  contract  FA8721-05-C-0002.  Opinions,  interpretations,  conclu¬ 
sions  and  recommendations  are  those  of  the  authors  and  are  not  necessarily 
endorsed  by  the  United  States  Government. 


this  year’s  evaluation  and  their  results. 

1.1.  IWSLT-2011  Data  Usage 

We  submitted  systems  for  the  ASR  task  and  English-to- 
French  and  Arabic-to-English  MT  tasks.  In  each  case,  we 
used  data  supplied  by  the  evaluation  for  each  language  pair 
for  training  and  optimization.  For  English-to-French  sys¬ 
tems,  data  from  Gigaword  and  Europarl  corpora  were  used 
for  both  language  model  and  phrase  model  training.  For  Ara¬ 
bic,  our  systems  were  strictly  limited  to  the  TED  training 
supplied  by  the  evaluation. 

For  cross-domain  adaptation  experiments  conducted  on 
the  English-to-French  data  sets,  the  TED  training  data  was 
used  to  adapt  these  initial  models  to  the  TED  domain(s). 

We  employ  a  minimum  error  rate  training  process  to  op¬ 
timize  model  parameters  with  a  held-out  development  set 
(dev2  010).  The  resulting  models  and  optimization  param¬ 
eters  can  then  be  applied  to  test  data  during  the  decoding  and 
rescoring  phases  of  the  translation  process. 

2.  Baseline  MT  System 

Our  baseline  system  implements  a  fairly  standard  SMT  archi¬ 
tecture  allowing  for  training  of  a  variety  of  word  alignment 
types  and  rescoring  models.  It  has  been  applied  successfully 
to  a  number  of  different  translation  tasks  in  prior  work,  in¬ 
cluding  prior  IWSLT  evaluations.  The  training/decoding  pro¬ 
cedure  for  our  system  is  outlined  in  Table  1.  Details  of  the 
training  procedure  are  described  in  [7]. 

2.1.  Phrase  Table  Training 

To  maximize  phrase  table  coverage,  we  combine  multiple 
word  alignment  strategies,  extending  the  method  described 
in  [8].  For  all  language  pairs,  we  combine  alignments  from 
IBM  model  5  (see  [11]  and  [12])  with  alignments  extracted 
using  the  competitive  linking  algorithm  (CLA)  described 
in  [9]  and  the  Berkeley  Aligner  [10],  Phrases  were  extracted 
from  both  types  of  alignments  and  combined  in  one  phrase 
table.  This  was  done  by  summing  counts  of  phrases  extracted 


Training  Process 

1 .  Segment  training  corpus 

2.  Compute  GIZA++,  Berkeley  and  Competitive  Linking 
Alignments  (CLA)  for  segmented  data  [8]  [9]  [10] 

3.  Extract  phrases  for  all  variants  of  the  training  corpus 

4.  Split  word-segmented  phrases  into  characters 

5.  Combine  phrase  counts  and  normalize 

6.  Train  language  models  from  the  training  corpus 

7.  Train  TrueCase  models 

8.  Train  source  language  repunctuation  models 

Decoding/Rescoring  Process 

1 .  Decode  input  sentences  use  base  models 

2.  Add  rescoring  features  (e.g.  IBM  model-1  score,  etc.) 

3.  Merge  N-best  lists  (if  input  is  ASR  N-best) 

4.  Rerank  N-best  list  entries 


Table  1 :  Training/decoding  structure 


from  alignment  types  before  computing  the  relative  frequen¬ 
cies  used  in  our  phrase  tables. 

2.2.  Language  Model  Training 

During  the  training  process  we  built  n-gram  language  models 
for  use  in  decoding/rescoring,  TrueCasing  and  repunctuation. 
In  all  cases,  the  MIT  Language  Modeling  Toolkit  [13]  was 
used  to  create  interpolated  Knesser-Ney  LMs.  Additional 
class-based  language  models  were  also  trained  for  rescoring. 
Some  systems  made  use  of  3-  and  7-gram  language  models 
for  rescoring  trained  on  the  target  side  of  the  parallel  text. 

2.3.  Optimization,  Decoding,  and  Rescoring 

Our  translation  model  assumes  a  log-linear  combination  of 
phrase  translation  models,  language  models,  etc. 

logP(E|F)oc]TArhr(E,F) 

Vr 


defined  as  follows: 


a  max(0,min(C, 


C{e,e)-(si  1(/,e)-si  1(/,e)) 
l|h(/,  e)  —  h(/,  e)|| 


s'  X(/,  e)  =  Wj_i  -  ht(/,  e) 

£(e,  e)  defines  a  loss  function  (in  our  case,  the  BLEU  score 
difference  between  the  oracle  translation,  e,  and  the  current 
best  translation,  e.  C  is  a  limiter  on  the  update  scaling.  It’s 
easy  to  see  that  update  size  at  each  iteration  is  proportional  to 
the  difference  between  the  loss  value  and  the  predicted  score 
margin. 

Weights  w;  are  updated  sentence  by  sentence  (order  of 
presentation  is  randomized)  until  either  a  convergence  crite¬ 
rion  is  met  or  a  limit  on  the  number  of  iterations  is  reached. 
Our  implementation  of  MIRA  follows  the  procedure  in  [23] 
for  oracle  selection  and  scoring. 

A  full  list  of  the  independent  model  parameters  that  we 
used  in  our  baseline  system  is  shown  in  Table  2.  All  systems 
generated  N-best  lists  that  are  then  rescored  and  reranked  us¬ 
ing  either  a  ML  or  an  MBR  (Minimum  Bayes  Risk)  criterion. 


Decoding  Features 

pm 

P(e\f) 

LexW(  f|e) 

LexW(e  |f) 

Phrase  Penalty 
Lexical  Backoff 
Word  Penalty 
Distortion 

.P(E)  —  6-gram  language  model 
Rescoring  Features 

•Preseore(E)  -  7-gram  LM 
Pciass(E)  -  7-gram  class-based  LM 
fjWodeti(F|E)  -  IBM  model  1  translation  probabilities 


Table  2:  Independent  models  used  in  log-linear  combination 


To  optimize  system  performance  we  train  scaling  factors, 
Ar,  for  both  decoding  and  rescoring  features  so  as  to  mini¬ 
mize  an  objective  error  criterion.  This  is  done  using  a  stan¬ 
dard  Powell-like  grid  search  performed  on  a  development 
set  [14], 

In  addition  to  the  Powell-based  approach,  a  number  of 
our  systems  used  the  MIRA  algorithm  for  weight  optimiza¬ 
tion  [23,  22,  24],  In  this  approach,  weights  are  optimized 
subject  to  a  maximum  margin  constraint  in  an  online  fashion. 
The  equation  below  shows  the  update  procedure  for  weights 
Wi  corresponding  to  the  ith  online  iteration  of  the  algorithm. 

w,  =  Wi_i  +  a  *  (h(/,  e)  -  h(/,  e)) 

where  e  denotes  the  oracle  translation  for  a  source  sentence 
/,  h(/,  e)  is  a  vector  of  model  scores  corresponding  to  the 
translation  of  /  into  e,  and  a  is  an  update  scaling  parameter 


These  model  parameters  are  similar  to  those  used  by 
other  phrase-based  systems.  For  IWSLT,  we  also  add  source- 
target  word  translation  pairs  to  the  phrase  table  that  would 
not  have  been  extracted  by  the  standard  phrase  extraction 
heuristic  from  IBM  model  5  word  alignments.  These  phrases 
have  an  additional  lexical  backoff  penalty  that  is  optimized 
during  minimum  error  rate  training. 

This  system  serves  as  the  basis  for  a  number  of  the 
contrastive  systems  submitted  during  this  year’s  evaluation. 
Contrastive  systems  differ  in  terms  of  their  rescoring  con¬ 
figuration  (e.g.  language  models,  MBR)  and  the  data  used 
to  train  them  (some  systems  made  use  of  additional  lexicon 
data).  Each  of  the  contrastive  systems  was  used  as  a  com¬ 
ponent  for  system  combination.  The  combined  output  for 
each  of  the  Turkish-to-English  and  Arabic-to-English  tasks 
was  submitted  as  our  primary  system.  Detailed  differences 
of  each  submitted  system  can  be  found  in  section  8. 


The  raoses  decoder  [15]  was  used  for  our  baseline  sys¬ 
tem. 


3.  Automatic  Speech  Recognition 

Acoustic  training  data  for  our  ASR  system  were  harvested 
from  TED.  We  downloaded  807  TED  Talks  that  were 
recorded  prior  to  201 1 ,  and  used  FFmpeg  to  extract  16  kHz 
audio  from  each  video  file.  Word  alignments  were  automat¬ 
ically  generated  for  each  talk  using  an  HTK  HMM  system 
that  was  trained  on  the  HUB4  English  Broadcast  News  cor¬ 
pora  [1,  2],  Long  periods  of  non-speech  were  removed,  and 
each  talk  was  split  into  utterances  shorter  than  20  seconds. 
Next,  closed  caption  filtering  [3]  was  applied  to  sequester 
utterances  that  may  include  transcription  errors.  Each  talk 
was  decoded  using  the  HUB4  HMMs  and  a  Language  Model 
(LM)  that  was  estimated  from  the  transcript  for  the  talk.  The 
recognizer  outputs  were  compared  to  the  transcripts,  and  a 
data  partition  was  created  using  all  utterances  with  a  Word 
Error  Rate  (WER)  less  than  20%.  This  process  yielded  164 
hours  of  audio. 

An  HMM  system  was  trained  on  the  TED  acoustic  data 
using  HTK.  Phonemes  were  modeled  using  state-clustered 
cross-word  triphones,  and  the  final  HMM  set  included  6,000 
shared  states  with  an  average  of  28  mixtures  per  state.  The 
models  were  discriminatively  trained  using  the  Minimum 
Phone  Error  (MPE)  criterion.  The  feature  set  consisted  of 
12  Perceptual  Linear  Prediction  (PLP)  coefficients,  plus  the 
zeroth  coefficient,  with  mean  normalization  applied  on  a  per 
utterance  basis.  Delta,  acceleration,  and  third  differential  co¬ 
efficients  were  appended  to  form  a  52  dimensional  vector, 
and  Heteroscedastic  Linear  Discriminant  Analysis  (HLDA) 
was  applied  to  reduce  the  feature  dimension  to  39.  A  second 
set  of  models  was  estimated  that  included  Speaker  Adaptive 
Training  (SAT). 

Interpolated  LMs  were  trained  from  the  Europarl,  News 
commentary.  News  2007-2011,  and  the  provided  TED  data. 
Trigram  and  4-gram  LMs  were  estimated  for  decoding  and 
rescoring.  The  vocabulary  included  95,000  words,  and  un¬ 
known  pronunciations  were  added  to  the  CMU  dictionary 
using  the  Sequitur  grapheme-to-phoneme  system  [4], 

Decoding  was  performed  as  follows.  Initial  transcripts 
were  produced  using  HDecode  with  the  non-SAT  HMMs. 
Next,  the  MIT-LL  GMM  software  package  was  used  to  clus¬ 
ter  the  utterances  from  each  talk.  Constrained  Maximum 
Likelihood  Linear  Regression  (CMLLR)  transforms  were  es¬ 
timated  for  each  cluster,  and  recognition  lattices  were  gener¬ 
ated  using  the  SAT  HMMs.  The  final  transcripts  were  pro¬ 
duced  by  rescoring  the  lattices  with  the  4-gram  LM. 

4.  Cross  Domain  Adaptation 

During  this  evaluation  we  re-examined  the  approach  to  cross 
domain  adaptation  that  we  presented  in  last  year’s  evalua¬ 
tion  [2],  To  this  end,  we  built  a  general  purpose  model  for 
the  English-French  task  using  training  data  from  the  Giga- 


word  French-English  and  Europarl  corpora  [5]  for  each  lan¬ 
guage  respectively.  These  models  were  trained  using  over 
700k  sentence  pairs  of  data.  Using  the  provided  training  data 
from  the  IWSLT  evaluation,  we  applied  a  variation  of  the 
MAP  phrase  table  adaptation  procedure  described  last  year, 
which  is  shown  in  the  equations  below: 


p(s|f) 

A 


^Piwslt  (1 

— - — -7 - r; — i — rt - *po \iwslt\dev) 

NiWsit{S)  t)  +  Ngp(s,  t )  +  r 


where  pgp  and  plWM  are  phrase  probability  estimates  from 
the  general  purpose  and  IWSLT-domain  models  respectively, 
and  pa(iwslt\dev)  is  an  estimate  of  the  corpus  posterior 
given  a  dev  set. 

During  last  year’s  evaluation  we  used  a  strict  MAP  for¬ 
mulation,  i.e.,  the  ratio  of  counts  between  iwslt  and  gp  mod¬ 
els  determines  the  weighting  of  the  models.  During  this  eval¬ 
uation  we  introduce  a  corpus  posterior  probability  po  which 
we  approximate  via  dev  set  BLEU  scores  as  follows: 


Po(dev\iwslt) 


BLEU  (dev\\iwau) 
'icBLEU(dev\\c) 


where  BLEU  {dev  |AC)  is  simply  the  BLEU  scores  for  a  dev 
set  dev  under  a  translation  model  Ac.  The  idea  here  is  to 
incorporate  a  semantic  distance  measure  to  weight  the  con¬ 
tribution  of  phrase  counts  from  each  corpus. 

As  in  last  year’s  experiments,  phrase  table  adaptation  and 
language  model  interpolation  were  used  jointly  to  improve 
performance. 


5.  Arabic-specific  Morphological  Processing 

In  our  Arabic  systems  for  prior  year  evaluations  [4,  3,  2,  1], 
we  normalized  various  forms  of  alef  and  hamza  and  removed 
the  tatweel  character  and  some  diacritics  before  applying  a 
light  Arabic  morphological  analysis  procedure.  This  year, 
we  first  normalized  all  Unicode  Arabic  presentation  forms 
to  their  constituent  isolated  forms.  For  example,  Unicode 
\x{fef7}  (called  “Arabic  Ligature  Lam  with  Alef  with 
Hamza  Above  Isolated  Form”)  was  normalized  to  Unicode 
\  x  {  0  6  4  4  }  \  x  { 0  6  2  3 }  (i.e.,  “Arabic  Letter  Lam”  in  isolated 
form  followed  by  the  isolated  form  for  “Arabic  Letter  Alef 
with  Hamza  Above”).  Then,  we  performed  our  alef-hamza, 
tatweel,  and  diacritic  normalizations.  At  this  time,  we  fur¬ 
ther  removed  Unicode  \x{0670},  “Arabic  Letter  Super¬ 
script  Alef,”  and  normalized  Unicode  \x{ 0671 },  “Arabic 
Letter  Alef  Wasla,”  to  Unicode  \x{  0  627},  a  “bare”  alef. 
After  these  normalizations,  we  converted  Arabic  digits  and 
the  Arabic  percent  sign,  decimal  separator,  and  thousands 
separator  to  their  English  equivalents  and  tokenized  the  punc¬ 
tuation. 

In  our  2009  Arabic  MT  system  [2],  we  employed  a 
modification  of  our  earlier  light  morphological  analysis  pro¬ 
cess  that  we  called  Count-Mediated  Morphological  Analy¬ 
sis  (CoMMA).  The  CoMMA  process  segments  only  those 


dev2010 

tst2010 

1st  pass 

24.8 

HUB4  HMMs 

2nd  pass 

21.2 

4-gram  rescore 

22.6 

20.7 

1st  pass 

20.7 

TED  HMMs 

2nd  pass 

18.5 

■Sir 

4-gram  rescore 

17.8 

Table  3:  WERs  obtained  on  the  IWSLT  dev2010  and 
tst2010  partitions  using  the  HUB4  and  TED  HMMs 


tokens  that  occur  in  the  training  data  fewer  times  than  a 
user-chosen  threshold.  Tokens  that  occur  at  least  as  many 
times  as  the  threshold  are  passed  through  to  the  output  unseg¬ 
mented.  For  this  year’s  Arabic  system,  we  again  employed 
the  CoMMA  process  and  developed  six  MT  systems  using 
the  CoMMA  process  at  thresholds  of  1,000,  2,000,  5,000, 
10,000,  20,000,  and  50,000.  For  each  of  these  six  threshold 
values,  the  best  system  in  terms  of  BLEU  score  (after  ten 
optimization  runs)  was  used  in  our  system  combination  with 
the  other  Arabic  MT  systems  that  we  developed. 

6.  ASR  Experiments 

Table  3  shows  the  WERs  obtained  on  the  IWSLT 
dev2010  and  tst2010  partitions.  For  comparison  pur¬ 
poses,  we  have  also  included  the  results  obtained  with  the 
HUB4  HMMs.  Note  that  non-SAT  HMMs  were  used  for 
both  passes  with  the  HUB4  system.  From  Table  3,  we  can 
see  that  training  on  the  TED  acoustic  data  yielded  a  substan¬ 
tial  improvement  in  WER  compared  to  the  HUB4  models. 

7.  MT  Experiments 

With  each  of  the  enhancements  presented  in  prior  sections, 
we  ran  a  number  of  development  experiments  in  preparation 
for  this  year’s  evaluation.  This  section  describes  the  develop¬ 
ment  data  that  was  used  for  each  evaluation  track,  and  results 
comparing  the  aforementioned  enhancements  with  our  base¬ 
line  system. 

7.1.  Development  Data 

Tables  4  describes  the  development  and  training  set  configu¬ 
rations  used  for  each  language  pair  in  this  year’s  evaluation. 

7.2.  English-to-French  Translation  Baselines 

We  ran  a  number  of  baseline  systems  on  the  talk  task  data 
set  using  the  methods  described  in  prior  sections.  We  used 
the  WMT-supplied  segmenters  for  preprocessing  and  nor¬ 
malization,  as  well  as  in-house  tokenizers  for  Arabic  and 
French.  In  addition  to  the  IWSLT-supplied  TED  data,  data 
from  the  French  Gigaword  and  Europarl  corpora  was  used  for 
language/phrase  modeling  in  the  English-French  task  (our 
Arabic-English  system  makes  no  use  of  non-TED  data).  In 


Arabic 

English 

Sentences 

Running  words 

train  .  ,  , 

Avg.  Sent,  length 

Vocabulary 

90,542 

1,235,359 

13.64 

46,780 

1,477,768 

16.32 

34,447 

Sentences 

de v2  0 1 0  Running  words 

Avg.  Sent,  length 

934 

t  klbi  !»■ 

t jtlipfpip 

Sentences 

t  s  1 2  0 1 0  Running  words 

Avg.  Sent.  length 

507 

23,080 

13.87 

26,786 

16.10 

English 

French 

Sentences 

Running  words 

train  .  ,  , 

Avg.  Sent,  length 

Vocabulary 

107,268 

1,760,288 

16.41 

41,466 

1,840,764 

17.16 

53,997 

Sentences 

dev2  010  Running  words 
Avg.  Sent,  length 

934 

17,451 

18.68 

17043 

18.25 

Sentences 

t  s  1 2  0 1 0  Running  words 

Avg.  Sent,  length 

1664 

26,786 

16.10 

27,802 

16.71 

Table  4:  Corpus  statistics  for  all  language  pairs 


order  to  perform  development  experiments,  we  used  supplied 
development  data  (dev2  0 1 0  )  for  optimization,  and  we  held 
out  tst2010  for  development  testing.  Table  5  summarizes 
the  results  on  the  held-out  tst2010  set.  For  these  exper¬ 
iments,  the  reported  scores  are  an  average  of  five  optimiza¬ 
tion/decoding  runs  with  different  random  weight  initializa¬ 
tions. 

No  single  optimization  strategy  clearly  outperforms  the 
other,  though  the  addition  of  language  models  trained  on 
other  corpora  is  a  clear  benefit  (wl  .0-1.2  BLEU).  During  this 
evaluation  we  employed  a  perplexity- minimizing  interpola¬ 
tion  strategy:  a  single  LM  was  constructed  by  interpolating 
TED  LMs  with  LMs  trained  on  other  corpora  so  as  to  mini¬ 
mize  perplexity  on  dev2  010. 

7.2.1.  English-French  Domain  Adaptation  Experiments 

As  described  in  section  4,  we  applied  a  different  formulation 
of  the  MAP-based  count-smoothing  approach  we  introduced 
during  last  year’s  evaluation  and,  in  this  year’s  evaluation, 
we  also  introduced  a  corpus-distance  factor.  We  conducted 
experiments  on  the  English-to-French  translation  task  using 
out-of-domain  data  from  Europarl  and  Gigaword  corpora  for 
backoff  when  in-domain  model  probabilities  are  poorly  esti¬ 
mated. 

Table  6  compares  the  English-to-French  IWSLT  baseline 
(optimized  via  a  Powell  search)  against,  1)  the  MAP  adapta¬ 
tion  method  we  proposed  last  year  and  2)  MAP  with  a  corpus 
distance  factor  as  proposed  above. 


System 

Optimization  Method 

tst2010 

TED  PT  +  TED  LM 

MERT 

29.54 

TED  PT  +  TED  LM 

MIRA 

29.12 

TED  PT  +  TED  LM  +  additional  LMs 

MERT 

30.80 

TED  PT  +  TED  LM  +  additional  LMs 

MIRA 

31.07 

Table  5:  Summary  of  baseline  TED  English-French  translation  task  experiments 


In  both  cases,  a  gain  of  sal  BLEU  point  can  be  had.  In¬ 
tuitively,  by  using  relative  counts,  the  new  approach  allows 
more  refined  computation  of  the  A  used  to  compute  the  in¬ 
terpolated/adapted  probability  for  each  phrase.  This  method 
avoids  overweighting  the  gp  model  when  both  the  iwslt  and 
gp  models  have  relatively  few  counts. 

For  these  experiments,  the  reported  scores  are  an  aver¬ 
age  of  five  optimization/decoding  runs  with  different  ran¬ 
dom  weight  initializations.  Note  that  both  variants  of  our 
phrase  table  adaptation  result  in  gains  over  language  model 
interpolation  alone.  The  use  of  a  corpus-based  distance  mea¬ 
sure  in  addition  to  the  standard  MAP  approach  results  in  a 
small  ^0.2-0. 4  BLEU  gain  on  the  supplied  tst2010  data 
set,  but  the  results  on  the  tst2011  data  set  don’t  show  sig¬ 
nificant  differences.  This  could  be  due  to  mismatch  between 
dev2010  (which  was  used  to  compute  corpus  weights  for 
interpolation)  and  tst2011  .  More  experiments  will  be 
needed  to  explore  this  performance  gap. 

7.3.  Arabic  Morphology  Experiments 

We  evaluated  the  translation  results  from  the  modified 
CoMMA  processes,  as  described  above,  for  the  Arabic-to- 
English  translation  task  at  the  aforementioned  threshold  lev¬ 
els.  Table  7  shows  the  mean  BLEU  scores  (over  ten  optimiza¬ 
tion  runs)  on  the  the  Arabic  tst2  0 1 0  development  data  set 
when  applying  CoMMA.  These  systems  were  then  compared 
to  a  baseline  using  the  unmodified  CoMMA  procedure  as  de¬ 
scribed  in  last  year’s  system. 

The  revised  IWSLT11  CoMMA  process  did  not  con¬ 
sistently  outperformed  the  standard  CoMMA  process  in  a 
BLEU  signifcant  manner,  though  at  most  threshold  points 
there  was  a  slight  gain.  We  would  have  expected  that  the  re¬ 
vised  normalization  should  allow  for  more  consistent  Arabic 
phrase  extraction,  but  this  didn’t  results  in  large  BLEU  score 
gains,  perhaps  due  to  the  relatively  large  training  set  available 
training. 

8.  Evaluation  Summary 

As  part  of  this  year’s  evaluation  we  experimented  with  im¬ 
proved  cross-domain  adaptation,  improved  Arabic  morpho¬ 
logical  processing  and  refinements  to  our  multiple  MT  com¬ 
bination  approach.  These  developments  have  helped  to  im¬ 
prove  our  system  when  compared  with  our  2010  baseline. 

Table  8  summarizes  each  of  the  systems  submitted  for 
this  year’s  evaluation  and  how  they  compare  with  our  2010 


CoMMA 

Threshold 

Mean  BLEU 

Comma  (Old) 

CoMMA  (New) 

1,000 

20.40 

20.27 

2,000 

20.18 

20.26 

5,000 

20.33 

20.44 

10,000 

20.23 

20.44 

20,000 

21.06 

22.18 

50,000 

21.52 

22.10 

Table  7:  Mean  BLEU  scores  for  1WSLT10  and  1WSLT11 
CoMMA  systems  versus  threshold  for  the  Arabic  t  s  1 2  0 1 0 


baselines  (when  applicable)  on  the  t  st2  0 1 1  data  set. 
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